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The gapped local alignment score of two random sequences fol- 
lows a Gumbel distribution. If computers could estimate the parame- 
ters of the Gumbel distribution within one second, the use of arbitrary 
alignment scoring schemes could increase the sensitivity of searching 
biological sequence databases over the web. Accordingly, this article 
gives a novel equation for the scale parameter of the relevant Gum- 
bel distribution. We speculate that the equation is exact, although 
present numerical evidence is limited. The equation involves ascend- 
ing ladder variates in the global alignment of random sequences. In 
global alignment simulations, the ladder variates yield stopping times 
specifying random sequence lengths. Because of the random lengths, 
and because our trial distribution for importance sampling occurs 
on a different sample space from our target distribution, our study 
led to a mapping theorem, which led naturally in turn to an effi- 
cient dynamic programming algorithm for the importance sampling 
weights. Numerical studies using several popular alignment scoring 
schemes then examined the efficiency and accuracy of the resulting 
simulations. 

1. Introduction. Sequence alignment is an indispensable tool in mod- 
ern molecular biology. As an example, BLAST [2, 3, 18] (the Basic Local 
Alignment Search Tool, http : // www . ncbi . nlm . nih . gov/BLAST/) , a popu- 
lar sequence alignment program, receives about 2.89 submissions per sec- 
ond over the Internet. Currently, BLAST users can choose among only 5 
standard alignment scoring systems, because BLAST p- values must be pre- 
computed with simulations that take about 2 days for the required p-value 
accuracies. Moreover, adjustments for unusual amino acid compositions are 
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essential in protein database searches [33], and in that application, compu- 
tational speed demands that the corresponding p-values be calculated with 
crude, relatively inaccurate approximations [3]. Accordingly, for more than 
a decade, much research has been directed at estimating BLAST p- values in 
real time (i.e., in less than 1 sec) [7, 24, 26, 29], so that BLAST might use 
arbitrary alignment scoring systems. 

Several studies have used importance sampling to estimate the BLAST 
p- value [7, 9, 26]. To describe importance sampling briefly, let E denote the 
expectation for some "target distribution" P, let Q be any distribution, and 
consider the equation 



A computer can draw samples tJi (i = 1, . . . , r) from the "trial distribu- 
tion" Q to estimate the expectation: EX » r _1 Ya=i X(uJi)[dF(uJi) /dQ(ui)] . 
The name "importance sampling" derives from the fact that the subsets 
of the sample space where X is large dominate contributions to EX. By 
focusing sampling on the "important" subsets, judicious choice of the trial 
distribution Q can reduce the effort required to estimate EX. In importance 
sampling, the likelihood ratio dF(u>)/dQ(uj) is often called the "importance 
sampling weight" (or simply, the "weight") of the sample u. 

A Monte Carlo technique called "sequential importance sampling" can 
substantially increase the statistical efficiency of importance sampling by 
generating samples from Q incrementally and exploiting the information 
gained during the increments to guide further increments. Although se- 
quences might seem an especially natural domain for sequential sampling, 
most simulation studies for BLAST p-values have used sequences of fixed 
length. In contrast, our study involves sequences of random length. 

Here, as in several other importance sampling studies [7, 9, 26, 34], hid- 
den Markov models generate a trial distribution Q of random alignments 
between two sequences, where the sequences have a target distribution P. 
The other studies gloss over the fact that their trial and target distributions 
occur on different sample spaces, such as alignments and sequences. The 
other studies used sequences of fixed lengths, however, where a relatively 
simple formula for the weight dF/dQ pertains. For the sequences of random 
length in this paper, however, the stopping rules for sequential sampling 
complicate formulas for dF/dQ. Accordingly, the Appendix gives a general 
mapping theorem giving formulas for the weights dF/dQ. when each sam- 
ple from P corresponds to many different samples from Q. (In the present 
article, e.g., each pair of random sequences corresponds to many possible 
random alignments.) In addition to the mapping theorem, we also develop 
several other techniques specifically tailored to speeding the estimation of 
the BLAST p-value. 
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The organization of this article follows. Section 2 on background and no- 
tation is divided into 4 subsections containing: (1) a friendly introduction 
to sequence alignment and its notation; (2) a brief self-contained descrip- 
tion of the algorithm for calculating global alignment scores; (3) a technical 
summary of previous research on estimating the BLAST p-value introduc- 
ing our importance sampling methods; and (4) a heuristic model for random 
sequence alignment using Markov additive processes. Section 3 on Methods 
is also divided into 4 subsections containing: (1) a novel formula for the rel- 
evant Gumbel scale parameter A; (2) a Markov chain model for simulating 
sequence alignments (borrowed directly from a previous study [34] , but used 
here with a stopping time); (3) a dynamic programming algorithm for calcu- 
lating the importance sampling weights in the presence of a stopping time; 
and (4) formulas for the simulation errors. Section 4 then gives numerical 
results for the estimation of A under 5 popular alignment scoring schemes. 
Finally, Section 5 is our Discussion. 

2. Background and notation. 

2.1. Sequence alignment and its notation. Let A. = A\A2--- and B = 
B1B2 ■ ■ ■ be two semi-infinite sequences drawn from a finite alphabet £, for 
example, {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y} (the amino 
acid alphabet) or {A, C,G,T} (the nucleotide alphabet). Let s:£ x £ i— > R 
denote a "scoring matrix." In database applications, s(a,b) quantifies the 
similarity between a and 6, for example, the so-called "PAM" (point ac- 
cepted mutation) and "BLOSUM" (block sum) scoring matrices can quan- 
tify evolutionary similarity between two amino acids [11, 16]. 

The alignment graph Ta.b of the sequence-pair (A, B) is a directed, 
weighted lattice graph in two dimensions, as follows. The vertices v of Fa,b 
are nonnegative integer points (i,j). (Below, ":=" denotes a definition, e.g., 
the natural numbers are N := {1,2, 3, . . .}. Throughout the article, i,j, k, m, n 
and g are integers.) Three sets of directed edges e come out of each vertex 
v = (i,j): northward, northeastward and eastward (see Figure 1). One north- 
eastward edge goes into v = (i + 1, j + 1) with weight s[e] = s(Ai + ±, -Bj+i). For 
each g > 0, one eastward edge goes into v = (i+g,j) and one northward edge 
goes into v = + g); both are assigned the same weight s[e] = —w g < 0. 
The deterministic function w:Nh (0, oo] is called the "gap penalty." (The 
value w g = oo is explicitly permitted.) This article focuses on affine gap 
penalties w g = Ao + A\g (Ao, Ai > 0), which are typical in BLAST sequence 
alignments. Together, the scoring matrix s(a, b) and the gap penalty w g con- 
stitute the "alignment parameters." 

A (directed) path tt = (^o, e\,v\, e%, . . . , e^, i^) in Ta,b is a finite alternat- 
ing sequence of vertices and edges that starts and ends with a vertex. For 
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Fig. 1. Gapped global alignment scores and the corresponding directed paths for two 
subsequences A[l, 10] = TACTAGCGCA and B[l,9] = ACGGTAGAT, drawn from the nu- 
cleotide alphabet {A,C,G,T}. Figure 1 uses a nucleotide scoring matrix, where s(a,b) = 5 
if a = b and —4 otherwise, and the affine gap penalty w g = 3 + 2g. The vertex is in the 
northeast corner of the cell with the origin (0,0) at the southwest corner of Figure 1. 

The cell (i,j) displays the global score Sij, calculated from (2.2). The optimal global path 
ending at the point (10,8) , for example, consists of 12 edges, in order: 1 east of length 1,2 
northeast, 1 north of length 2, 3 northeast, 1 east of length 3, and 1 northeast. The optimal 
global score 5io,8 = —5 + 5 + 5 — 7 + 5 + 5 + 5 — 9 + 5 = 9 is the sum of the corresponding 
edges and represents the path of greatest weight starting at (0,0) and ending at (10,8). 
The corresponding optimal global alignment of the subsequences A[l, 10] and B[l,9] is 

TAC--TAGCGCA 

-ACGGTAG A. 

The edge maxima are Mi = -4, M% = 0, M 3 = 5, Ma = 1, Mb = 3, M 6 = 8, M 7 = 13, Af 8 = 9, 
Mg = 6. The shading and the double lines indicate squares where a vertex (sur- 
rounded by double lines) generated an SALE (3{k). The SALE scores are M^m = 
M:i — 5,4/^(2) = Me = 8,4^(3) = M7 — 13; and the global maximum M for A and B is 
no less than 13, the largest global score shown. 

each i = 1,2, . .. ,k, the directed edge ej comes out of vertex Vi-± and goes 
into vertex i>j. We say that the path tt starts at and ends at Vk- 

Denote finite subsequences of the sequence A by A[i, m] = AiAi + i • • • A m . 
Every gapped alignment of the subsequences A[i,m] and B[j, n] corresponds 
to exactly one path that starts at vq = (i — 1, j ' — 1) and ends at Vk = (m, n) 
(see Figure 1). The alignment's score is the "path weight" := J2i=i s i e i\- 

Define the "global score" Sij := max^S^, where the maximum is taken 
over all paths ir starting at vq = (0,0) and ending at = The paths tt 
starting at vo, ending at Vk, and having weight S n = Sij are "optimal global 
paths" and correspond to "optimal global alignments" between A[l,z] and 
B[l,j]. Define the "edge maximum" M n : — max{maxo<i<n Si^ n , maxo<j< n . S n 
and the "global maximum" M :=sup n>0 M n . (The single subscript in M n 
indicates that the variate corresponds to a square [0, n] x [0, n] , rather than a 
general rectangle [0,m] x [0,n].) Define the "strict ascending ladder epochs" 
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(SALEs) in the sequence (M n ): let /3(0) := and [3{k + 1) := min{n > 
/3(k) : M n > Mp(k)}, where min0 := oo. We call M^ k ) the "fcth SALE score." 

Define also the "local score" Sij '■= max,,- S n , where the maximum is taken 
over all paths it ending at Vk = regardless of their starting point. Define 
the "local maximum" M mt1l := maxo<j<m I o<j<n The paths ir ending 
at Vk = with local score S n — ^ij — M m n are "optimal local paths' 
corresponding to the "optimal local alignments" between subsequences of 
A[l,m] and B[l,ra]. 

Now, the following "independent letters" model introduces randomness. 
Choose each letter in the sequence A and B randomly and independently 
from the alphabet £ according to fixed probability distributions {p a : a G £} 
and {p' b :b£L £}. (Although this article permits the distributions {p a } and 
{p' b } to be different, in applications they are usually the same.) Throughout 
the paper, the probability and expectation for the independent letters model 
are denoted by P and E. 

Let r = Ta,b denote the random alignment graph of the sequence-pair 
(A,B). In the appropriate limit, if the alignment parameters are in the so- 
called "logarithmic phase" [6, 12] (i.e., if the optimal global alignment score 
of long random sequences has a negative score) , the random local maximum 
M mtn follows an approximate Gumbel extreme value distribution with "scale 
parameter" A and "pre-factor" K [1, 14], 

(2.1) P(M m , n > y) w l-exp[-Kmnexp(-\y)]. 

2.2. The dynamic programming algorithm for global sequence alignment. 
For affine gaps w g = Ao + Ai<7, the global score Sij is calculated with the 
recursion 

(2.2) Sij = max{S i -ij-i,I i -i t j-i, A-ij-i} + 
where 

Iij = max{S' iiJ „i - A - A 1 ,I i j_i - Ai, Aj-i - A - Ai}, 

Dij = max{5j_ij — Ao — Ai, A-ij — Ai} and boundary conditions 6*0,0 = 
0,-fo,o = A),o = — co, Dgfl = Io,g = — Ao — Aig,S g fi = Sq i9 = I 9i o = Dq j9 = 
—00 for g > [15]. The three array names, S, I, and D, are mnemonics for 
"substitution," "insertion" and "deletion." If "A" denotes a gap character, 
the corresponding alignment letter-pairs (a, 6), (A, b) and (a, A) correspond 
to the operations for editing sequence A into sequence B [30]. 

2.3. Previous methods for estimating the BLAST p-value. If w g = 00 
identically, so northward and eastward (gap) edges are disallowed in an 
optimal alignment path, a rigorous proof of (2.1) yields analytic formulas 
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for the Gumbel parameters A and K [12]. For gapped local alignment, rig- 
orous results are sparse, although some approximate analytical studies are 
extant [21, 22, 27, 29]. The prevailing approach therefore estimates A and 
K from simulations [4, 31]. Because A is an exponential rate, it dominates 
K's contribution to the BLAST p-value. Most studies therefore (including 
the present one) have focused on A. (Note, however, some recent progress 
on the real-time estimation of K [26].) Typically, current applications re- 
quire a 1-4% relative error in A; 10-20%, in K [4]. The characteristics of the 
relevant sequence database determine the actual accuracies required, how- 
ever, making approximations with controlled error and of arbitrary accuracy 
extremely desirable in practice. 

Storey and Siegmund [29] approximate A (with neither controlled errors 
nor arbitrary accuracy) as 

(2.3) A « A* - 2(/i*)- 1 Ae- A * A 7(e A * Al - 1), 

where J2( a ,b) PaPb exp[A*s(a, b)] = 1 [so A* is the so-called "ungapped lambda," 
for A(<?) = oo] and /i* := J2(a,b) s (°> ^)PaKexp[A*s(a, b)]. In (2.3), A is an 
upper bound for an infinite sequence of constants defined in terms of gap 
lengths in a random alignment. 

Many other studies have used local alignment simulations to estimate 
BLAST p-values, for example, Chan [9] used importance sampling and a 
mixture distribution. Some rigorous results [28] are also extant for the so- 
called "island method" [31, 32], which yields maximum likelihood estimates 
of A and K from a Poisson process associated with local alignments exceed- 
ing a threshold score [4, 23]. 

Large deviations arguments [6, 35] support the common belief that global 
alignment can estimate A for local alignment through the equation A = 
— lim^oo y" 1 lnP{M > y}. For a fixed error, global alignment typically re- 
quires less computational effort than local alignment. For example, one early 
study [34] used importance sampling based on trial distributions Q from a 
hidden Markov model. 

The study demonstrated that the global alignment equation E[exp(A,S n ,n)] = 
1 estimated A with only 0(n~ l ) error [7]. (Recall that "E" denotes the 
expectation corresponding to the random letters model.) The equation 
E[exp(AM m )] = E[exp(AM n )] (m ^ n), suggested by heuristic modeling with 
Markov additive processes (MAPs) [5, 10], improved the error substantially, 
to 0{e n ) [24]. 

The next subsection shows how the MAP heuristic can improve the effi- 
ciency of importance sampling even further, with its renewal structure. The 
next subsection gives the relevant parts of the MAP heuristic. 
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2.4. The Markov additive process heuristic. The rigorous theory of MAPs 
appears elsewhere [5, 10]. Because the MAP heuristics given below parallel 
a previous publication [24], we present only informal essentials. 

Consider a finite Markov-chain state-space 3, containing #3 elements. 
Without loss of generality, 3 = {1, . . . , #3}- Until further notice, all vectors 
are row vectors of dimension #3; all matrices, of dimension (#3) x (#3). 
A MAP can be defined in terms of a time-homogenous Markov chain (MC) 
(Jn S = 0, 1,...) and a (#3) X (#3) matrix of real random variates 
Let the MC have transition matrix P = so pij = P( J n = 

j\Jn—i = *)■ Let the stationary distribution of the MC be 7r, assumed strictly 
positive and satisfying both 7rP = 7r and nl 1 = 1, where 1* denotes the 
(#3) x 1 column vector whose elements are all 1. 

As usual, let P-y and E 7 be the probability measure and expectation cor- 
responding to an initial state Jo with distribution 7; Pj and Ej, to an initial 
state Jo = i; and P,,- and E n , to an initial state in the equilibrium distribu- 
tion 7T. 

Run the MC (J n ), and take its succession of states as given. Consider the 
following sequence (Y n sR:n = 0,l,...) of random variates. Define lo := 0. 
For n = 1, 2, . . . , let the (Y n ) be conditionally independent, with distributions 
determined by the transition J n _i — > J n of the Markov chain as follows. If 
J n _i = i and J n = j, the value of Y n is chosen randomly from the distribution 
of Zij. (Thus, if J m -i = J n -i = i and J m = J n = j,Y m and Y n share the 
distribution of Zij, although independence permits randomness to give them 
different values.) 

The random variates of central interest are the sums T n = J2m=o ( n 
i). I . . . .) and the maximum M := max n >oT n . To exclude trivial distributions 
for M (i.e., M = a.s. and M = 00 a.s.), make two assumptions: (1) E^Yi < 
0; and (2) there is some m and state i such that 

(2.4) Pi{min{T fc : k = 1, . . . , m} > 0; J m = t, Jj ^ i for j = 1, . . . , m - 1} > 0. 

Consider the sequence (T n ), its SALEs f3(0) := and (3(k + 1) := min{n > 
(3{k):T n > Tpr k \}, and its SALE scores Tmja. For brevity, let (3 := 
Note that M = Twfy for some k € {0, 1,. . .}. In a MAP, (Jptk)^f3(k)) forms 
a defective Markov renewal process. 

Now, define the matrix Lg> := ||Ej[exp(0T^); Jp =j,j3< 00] |. The Perron- 
Frobenius theorem [5], page 25, shows that L# has a strictly dominant eigen- 
value p(9) > [i.e., p(6*) is the unique eigenvalue of greatest absolute value]. 
Moreover, p(9) is a convex function [19], and because Lo is substochastic, 
p(0) < 1. The two assumptions above (2.4) ensure that M := max n >oT n has 
a nontrivial distribution and that p(X) = 1 for some unique A > 0. 

The notation intentionally suggests a heuristic analogy between MAPs 
and global alignment. Identify the Markov chain states J n in the MAP with 
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the rectangle [0, n] x [0, n] of Ta.Bj and identify the sum T n in the MAP with 
the edge maximum M n in global alignment. In the following, therefore, the 
identification leads to M n replacing T n in the MAP formulas. In particular, 
the MAP heuristic identifies the Gumbel scale parameter in (2.1) with the 
root A > of the equation p(X) = 1. Although the heuristic analogy between 
MAPs and global alignment is in no way precise or rigorous, it has produced 
useful results [24]. 

The details of why the MAP heuristic works so well are presently obscure, 
although some additional motivation appears in an heuristic calculation re- 
lated to A [8]. The calculation takes the limit of nested successively wider 
semi-infinite strips, each strip having constant width and propagating it- 
self northeastward in the alignment graph r^B- The successive northeast 
boundaries of the propagation are states in an ergodic MC. MAPs therefore 
might rigorously justify the heuristic calculation. 

3. Methods. 

3.1. A novel equation for A. From the definition of Lg in a MAP, if the 
Markov chain { J n } starts in a state Jo with distribution -y (with M n replac- 
ing T n in the MAP formulas), matrix algebra applied to the concatenation 
of SALEs in a MAP yields 

(3.1) E^eM0M m ); 0(k) < 00] = 7 (L e ) fc l*. 

For a MAP, equation (3.1) is exact; but for global alignment, it has no literal 
meaning. Equation (3.1) has some consequences for the limit k — ► 00, and 
we speculate that the consequences hold, even for global alignment. [Note: 
although the sequence (j3(k)) is a.s. finite, the limits k — > 00 below involve 
no contradiction or approximation, because they are not a.s. limits.] 

Define K k {9) := ln{E T [exp(0M^(^)); P(k) < 00] }• In (3.1), a spectral (eigen- 
value) decomposition of the matrix L# [25] shows that 

(3.2) K k (9) = kln{p(8)} + c + O(e k ), 

where < e < 1 is determined by the magnitude of the subdominant eigen- 
value of Liq, and Co is a constant independent of 9 and k. 

For k' — k > fixed, we can accelerate the convergence in (3.2) as k — > 00 
by differencing 

(3.3) K k ,(9) - K k {9) = (k f - k)ln{p{9)} + 0(e k ). 

Let \ k ',k denote the root of (3.3) after dropping the error term 0{e k ). Be- 
cause p{\) = 1, Taylor approximation around A yields ln{p(\ k i , k )} ~ p'(^)(^k>,k — 
A), so (3.3) becomes 

(3.4) (k'-k)p'(\)(\ klik -\) = 0(e k ), 
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that is, with k' — k fixed, Xk',k converges geometrically to A as the SALE 
index k — > oo. 

The initial state 7 of global alignment has a deterministic distribution, 
namely the origin (0,0). Equation (3.3) for = A therefore becomes 

(3.5) E[exp(AM^ (fe , ) );/3(A ; / ) < 00] =M[exp(XM m );/3(k) < 00] 
after dropping the geometric error 0(e k ). Let \y,k be the root of (3.5). 

3.2. The trial distribution for importance sampling. In (3.5), crude Monte 
Carlo simulation generating random sequence-pairs with the identical let- 
ters model P is inefficient for the following reason. When practical align- 
ment scoring systems are used, ¥{P(k) < 00} < 1 for k > 1. For, example, 
the BLAST defaults (scoring matrix BLOSUM62, gap penalty w g = 11 + g, 
and Robinson-Robinson letter frequencies), P{/3(4) < 00} 0.047, so only 
about 1 in 20 crude Monte Carlo simulations generate a fourth ladder point. 
Empirically in our importance sampling, however, Gumbel parameter esti- 
mation seemed most efficient when the stopping time corresponded to /3(4) 
(see below). 

Importance sampling requires a trial distribution to determine \k\k from 
(3.5). By editing one sequence into another, a Markov chain model borrowed 
directly from a previous study [34] generates random sequence alignments, 
as follows. 

Consider a Markov state space consisting of the set of alignment letter- 
pairs £ 2 , where £ := £ U {A}, "A" being a character representing gaps. 
The ordered pair (A, A) has probability 0, so a succession of Markov states 
corresponds to a global sequence alignment (see Figure 1), that is, to a 
path in the alignment graph Fa,b- Ordered pairs other than (A, A) fall 
into three sets, corresponding to edit operations following (2.2): S := £ x £ 
[substitution, a bioinformatics term implicitly including identical letter-pairs 
(a, a)], I := {A} x £ (insertion); and D := £ x {A} (deletion). The sets S,I 
and D form "atoms" of the MC [13], page 203, as follows. (By definition, 
each atom of a MC is a set of all states with identical outgoing transition 
probabilities.) 

From the set S, the transition probability to (a, b) is ts^Qafi', to (A, b),tsjp' b ; 
and to (a, A),ts,DPa- From the set I, the transition probability to (a, b) is 
ti,SQa,b] to (A, b), tijp' b ; and to (a, A), ti^Pa- From the set D, the transition 
probability to (a, b) is tD,sQa,b', to (A, b),tDjp' b ; and to (a, A), to,DPa- Transi- 
tion probabilities sum to 1, so the following restrictions apply: Sa,&e£ ( /a,b = 
^EbesK = iiEae-ePa = Ms,S + ts,l + ts,D = 1 (transit from the substi- 
tution atom), tD,D + tD,s + £d,i = 1 (transit from the deletion atom) and 
ti.i + ti,s + ti,D = 1 (transit from the insertion atom). Usually in practice, 
the term ti t o — 0, to disallow insertions following a deletion. Our formulas 
retain the term, to exploit the resulting symmetry later. 
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In the terminology of hidden Markov models, S, I, D are hidden Markov 
states, tij for i,j G {S,I,D} are transition probabilities and q a ,b-,p'b-,Pa for 
a, b G £ are emission probabilities from the state S,I,D, respectively. 

As described elsewhere [34] , numerical values for the Markov probabilities 
can be determined from the scores s(a, b) and the gap penalty w g . Note that 
the values are selected for statistical efficiency, although many other values 
also yield unbiased estimates for A in the appropriate limit. 

3.3. Importance sampling weights and stopping times. To establish no- 
tation, and to make connections to the Appendix and its mapping theorem, 
note that the MC above can be supported on a probability space (0,F,Q), 
where each uj = (tt, A, B) G f2 is an ordered triple. Here, tt is an infinite path 
starting at the origin in the alignment graph Ta,b; F is the set generated 
by cylinder sets in £1 (here, cylinder sets essentially consist of some finite 
path and the corresponding pair of subsequences); and Q is the MC proba- 
bility distribution described above, started at the atom S, with expectation 
operator Eq. 

Let N be any stopping time for the sequence (M n : n = 0, 1, . . .) of edge 
maxima for Ta,b (i-e., the sequence {Mo, . . . , M n } determines whether N < 
n or not). Because M n is determined by (A[l, n], B[l, n]), N is also a stopping 
time for the sequence {(A[l, n], B[l, n]) : n = 0, 1, . . .}. The stopping time of 
main interest here is N = (3(k), the fcth ladder index of (M n ), where k > 1 is 
arbitrary. (As further motivation for the mapping theorem in the Appendix, 
other stopping times of possible interest include, for example, N = n, & fixed 
epoch [7], and N = 0(K y ), where (3(K y ) = inf{n : M n > y} is the index of first 
ladder-score outside the interval (0, y).) 

To use the mapping theorem, introduce the probability space ($1", F",P), 
where each uj" = (A, B) G fi" is an ordered pair. Here, A and B are se- 
quences, F" is the set generated by all cylinder sets in £1" (i.e., sets corre- 
sponding to pairs of finite subsequences) and F(A") = Yl\=iPA k Hk=iP'B k > 
if the cylinder set A" corresponds to the subsequence pair (A[l, i], B[l, j]). 
Given N, the theory of stopping times [5], page 414, can be used to con- 
struct a discrete probability space (Q', F',P), where each event uj' G O' is a 
finite-sequence pair uj' = (A[l, N], B[l, N]), F' is the set of all subsets of Q' 

and p(o/)=nSVnKX- 

Let l m>n := = m,j > n} and D mjn := {(i, j) :i > m,j = n}. Define 

the function / :ui i— > uj' , where uj = (tt, A, B) and uj' = (uj' a ,uj' b ) := (A[l, N], 
B[l, N]). Then, uj G / -1 (o/), if and only if the path tt hits the set ljv,iv U Djy jv 
at so that A[l, N] =uj' a and B[1,N] =ui' B (see Figure 2). 

Empirically, our simulations satisfied Q{/3(k) < oo} = 1, and we specu- 
late that our application therefore satisfies the hypothesis QiJ = 1 of the 
Appendix. According to the Appendix, the reciprocal importance sampling 
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weight l/W(u>) = J2u ef- 1 {f(u))} < Q( UJ o)/^f( U} ) depends on the sum over all 
possible Markov chain realizations ujq £ / _1 (u/). Dynamic programming 
computes the sum efficiently, as follows. 

Let the "transition" T represent any element of {S, I, D} [substitution 
(a,i,bj), insertion (A,bj), or deletion (04, A)]. Fix any particular pair (A, B) 
of infinite sequences, which fixes N = (3{k). To set up a recursion for dy- 
namic programming, consider the following set of events E^, defined for 
T G {S, I, D} and min{i,j} < N, and illustrated in Figure 2. Let Efj be 
the event consisting of all u> yielding a path tt whose final transition is 
T and which corresponds to the subsequences: (1) A[l, i] and B[l,j] for 
< i,j < N; (2) A[l,t] and B[1,N) for < N = j < i ; and (3) A[1,N) and 
B[l, j] for < N = i < j. Define Qf d := Q(Ej^) and Q itj := Q?j + Q 1 ^ + Qg. 




Fig. 2. Two examples of alignment path n generated by a Markov chain. As in Figure 

1, the shading and the double lines indicate squares where a vertex (surrounded by double 
lines) generated an SALE. The SALEs determine the stopping time N — /3(3). In Figure 

2, the first SALE is determined by the score at the vertex (3,3); the second SALE, the 
vertex (7,6); the third SALE, the vertex (9,10). Therefore, N = /?(3) = 10. The vertical 
ray \n,n and the horizontal ray Dn,n are indicated by double circles. The lower path n 
(solid line) ends at (N + 2,N) with a final transition to S; the upper path n (long-dashed 
line), at (N,N + 4) with a final transition to D. The closed vertices indicate intersection 
with the square corresponding to ui' — (uj' a ,uj' b ) = (A[l, N], B[l, N]). 
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(Note: in the following, T G {S, I, D} is always a superscript, never an ex- 
ponent.) 

For brevity, let q^j = q^Bj for < i, j < N;q it j = E(ae£) I^Bj for < 
j < N < i; tjij = J2(b££) lA t ,b for < i < N < j; and cjij = 1 otherwise. Let 
Pj = p'b fo r < j < iV; and 1 otherwise. Finally, Let pi = pa { for <i < N; 
and 1 otherwise. Because every path into the vertex comes from one 
of three vertices, each corresponding to a different transition T G {S,I,D}, 

Qij = %j{ts,sQi-i,j-\ + ti,sQi-i,j-i + tD,sQ?-\,j-x)i 

(3.6) Qlj = fyfajQfj-! + hMj-i + tD,iQ?j-i), 

Q?,j = Pi(ts,DQi-i,j + tl,DQi-i,j + tD,DQf-\j) 

with boundary conditions Q$ Q = 1,Qq,o = Qo,o = ®>Qg,0 = Qo, g = Qg,o = 
Qo, g = ° ; Qo,3 = PB 1 ---PB g t S,l(tl,i) 9 ~ 1 and Q? fi = p Al ■ ■ ■ PA g ts,D x 

{tD^) 9 " 1 (g>0). 

Recall that oj = (jr, A, B) G f~ l (u)'), if and only if the path tt hits the set 
Ijv.ArU Dtv,v at (i,j), so that A[l,iV] = to' A and B[l,iV] =Wg. Thus, 

oo oo 

(3.7) X) = ~Qn,n + E + Qnj) + E (Q?iv + QI,n)- 
ue/- 1 ^') j=JV i=JV 

To turn (3.6) into a recursion for importance sampling weights, define Pi := 
PAi--- PA min{i , N} =Pi---Pi and P'j :=p' Bl --- P' Bmin{j>N} =&V Pj , and let : 
Qf,/ {PiP'j) (T G {£, /, -D}). Let rj j = q\^J (piPj). For future reference, define 
r,j := rjj- for < j < N < i and r^. := r,j for < i < N < j. Note that r,j 
is independent of i, and r^. is independent of j. Equation (3.6) yields 

Wfj = r i , j (ts,sWiL 1 j_ 1 + t I)S WUj-i + tDfWE^i), 

(3.8) = tsjWfj.! + tj.jWi-.x + tDjW^-i, 
Wg = t s ,DWf_ hj + t I>D WUj + t D ,DWP ltj 

with boundary conditions W^ = 1, Wqq = W D = 0, W 9 5 = W$ <g = W$ g = 

W g D o = 0, = tsAtij) 9 ' 1 and W g°o = tsA^D) 9 " 1 (o > 0). Because of 
(3.7), the importance sampling weight W := W{oj) satisfies 

1 = Ec„e;-i{/H} QM 
W P/(w) 

(3-9) 



-w s N , N + E (w^- + w$ d ) + E (^5v + K 



N) 
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Because r^j = r 9 j (0 < j < N < i) and r,j = r^., (0 < i < N < j), only a 
finite number of recursions are needed to compute the infinite sums in (3.9), 
as follows. For Te{S,I,D}, define Of := U? N , where U^ n :=£~ n W^-. 

Likewise, define Vj := Vjy ■, where V^ n := X^m^vr Equation (3.9) be- 
comes 

1 

if 

Note that Ufj_i — Ufj = Wjj_ x . To determine Uj^, summation of (3.8) 
for < i < N < j yields 



(3-10) 4- = -W%„ + U% + Ug + Vg + 



(3.11) 



IjS. _ yyS 

""1,3— 1' 



=ts,DU i _ ljj + t I , D U-_ lJ +t DjD uP_ lj . 



Elimination of Ufj for j = A r + 1 and z = 1, . . . , N in the first two equations 
yields 

Uf, N = ri,.(ts,sUf_ hN + ti iS UU, N + tD,sUP ltN ) + Wgvr, 

(3.12) U! N = t SJ Uf >N + tijU! N + t D ,iUP N + Wi >m 

that is, 

(3.13) L// = (1 - tuy^tsjUf + t DJ UP + W? N ), 

Of = ts,DUf_i + -I- to,DUP_i 

with initial values J7 5 = C/ D = and t/^ = (1 - i/,/)" 1 ^^ = (1 - t/,/)" 1 x 
ts,i(tjj) 1 . Compute (3.13) recursively for i = 1, . . . ,N. 
Similarly, reflect through i= j to derive 

Vf = r. d (t s , s Vf-i + t DlS VP x + ti, s V 3 U) + Wgj, 

(3.14) V/ = tsjVf^ + t D j/P_ x + tuVjU, 

yp = (1 - t DJ3 )- x (t SJ3 vf + t ItD v/ + wg d ) 

with initial values V S = Vq =0 and = (l-tD,u) -1 Wjv,o = (l-^,/)) -1 x 
^S,d(Pd,d) • Iterate (3.14) for j = 1,...,N. Substitute the results for 
U§,Ufi,V$, and into (3.10) to compute W. 
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3.4. Error estimates for Xk'.k- Denote the indicator of an event A by 
L4, that is, L4 = 1 if A occurs and otherwise. For a realization uj in the 
simulation, define 

hk,k'{9) '■= hk,k>(8;u) 

(3.15) 

:= exp(eMp {kl) )I\/3(k') < oo] - exp(6M m )I\p(k) < oo] 

and let h' k k , be its derivative with respect to 9. 

Given samples tOi (i = l,...,r) from the trial distribution Q, let W = 
W(ui) denote the corresponding importance sampling weights. Because \ k ',k 
is the M-estimator [17] of the root \k',k oi¥,hk : k'(^k',k) = 0, as r — > oo, \/r{\k',k ~ 
Afc',fc) converges in distribution to the normal distribution with mean and 
variance [17] 

EQ[h(\ k ,, k )W] 2 ^ r-^im^Xk'^Wjui)} 2 

4. Numerical study for Gumbel scale parameter. Table 1 gives our "best 
estimate" A of the Gumbel scale parameter A from (3.5) for each of the 5 
options BLASTP gives users for the alignment scoring scheme. For every 
scheme, estimates A derived from the first to fourth SALEs indicated that A 
generally is biased above the true value A, but that A converged adequately 
by the fourth SALE. The best estimate A (shown in Table 1) is the average 
of 200 independent estimates A, each computed within 1 sec from sequence- 
pairs simulated up to their fourth SALE. For BLOSUM 62 and gap penalty 
w g = 11 + g, the average computation produced 1441 sequence-pairs up to 
their fourth SALE within 1 second. (For results relevant to the other publicly 
available scoring schemes, see Table 1.) The best estimates A derived from 
(3.5) were within the error of the BLASTP values for A. 

Despite having the variance formula in (3.16) in hand, we elected to es- 
timate the standard error s\ directly from the 200 independent estimates 
A. Figure 3 plots the relative error s\/A in each individual A against the 
computation time, where s\ is the standard error of A. It shows that for all 
5 BLASTP online options, (3.5) easily computed A to 1-4% accuracy within 
about 0.5 seconds. 

5. Discussion. This article indicates that the scale parameter A of the 
Gumbel distribution for local alignment of random sequences satisfies (3.5), 
an equation involving the strict ascending ladder-points (SALEs) from global 
alignment, at least approximately. For standard protein scoring systems, in 
fact, simulation error could account for most (if not all) of the observed dif- 
ferences between values of A calculated from (3.5) and values calculated from 
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Table 1 

Best estimates X for the 5 BLASTP alignment scoring schemes. For each scheme, we 
generated 200 estimates X, each within a one-second computation time. The third column 

gives present estimates of X used on the BLAST web page (Stephen Altschul: personal 
communication). The BLAST values are accurate to approximately ±1%. The fourth 
column gives the mean X of our 200 estimates X; the fifth, the standard error of X, which 

can be multiplied by V200 ~ 14 to give the standard error in each X. The sixth column 
gives the average number of sequence-pairs used to estimate each X. The total number of 
sequence-pairs used for X is 200 times average number of sequence-pairs. The last column 

gives the average sequence length required for the fourth SALE used to estimate each X 





Gap 


BLAST 


Best 


Standard 




Average 


Scoring 


penalty 


value 


estimate 


error of 


Average number of 


sequence 


matrix 


Wg 




A 


A 


sequence-pairs 


length 


BLOSUM80 


W + g 


0.299 


0.2998 


0.0001 


2865 


15.85 


BLOSUM62 


11 + g 


0.267 


0.2679 


0.0002 


1441 


27.78 


BLOSUM45 


U + 2g 


0.195 


0.1962 


0.0003 


789 


39.23 


PAM30 


9 + g 


0.294 


0.2956 


0.0001 


3593 


9.20 


PAM70 


W + g 


0.291 


0.2922 


0.0001 


3397 


11.49 



extensive crude Monte Carlo simulations. (The values of A from crude sim- 
ulation have a standard error of about ±1%.) In SALE simulations, (3.5) 
estimated A to 1-4% accuracy within 0.5 second, as required by BLAST 
database searches over the Web. The present study did not tune simula- 
tions much; it relied instead on methods specific to sequence alignment to 
improve estimation. Many general strategies for sequential importance sam- 
pling therefore remain available to speed simulation. Preliminary investiga- 
tions estimating the other Gumbel parameter (the pre-factor K) with SALEs 
are encouraging, so online estimation of the entire Gumbel distribution for 
arbitrary scoring schemes appears imminent, and preliminary computer code 
is already in place. 

APPENDIX: A GENERAL MAPPING THEOREM FOR 
IMPORTANCE SAMPLING 

The following theorem describes an unusual type of Rao-Blackwellization 
[20]. Consider two probability spaces (f2, F,Q) and (Q',F',P), and a F/F'- 
measurable function / :0 i— ► Q' (i.e., f~ 1 F' E F for every F' G F'). Note: / 
is explicitly permitted to be many-to-one. Let P << Q/" 1 on some set H 1 
(i.e., Qf-^-G' = =► FG = for any set G' C H'), so the Radon-Nikodym 
derivative in the second line of (A.l) below exists. Let H := f~ l H' , so for 
every random variate X' on (Q 1 , F'), 

ELY'; H'] := [ X'{J)dF{J) 
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(A.l) 



X'{J)dF(J) . 

, x'm-r dFfiLJ L, ^ QM 



Consider the application of (A.l) to importance sampling with target dis- 
tribution P and trial distribution Q. Assume QH = 1, so H supports Q. In 
our application to global alignment, H = [(3(k) < oo] C ("C" being strict 
inclusion), but we speculate QH = 1. 

In Monte Carlo applications, a discrete sample space H is usually avail- 
able. Accordingly, the following theorem replaces the integrals in (A.l) by 
sums. 

The mapping theorem for importance sampling. Let 

/a rj\ i E^ g/-i{/H} QM 




Fig. 3. Plot of relative errors against computation time (sec). Both axes are in logarith- 
mic scale. Computation time was measured on a 2.99 GHz Pentium® D CPU. Relative er- 
rors for BLOSUM45 with A(g) = U + 2g are shown by ■ BLOSUM62 with A(g) = 11 + g, 
by BLOSUM80 with A(g) = 10 + g, by Q; PAM70 with A(g) = 10 + g, by "; PAM30 
with A(g) = 9 + g, by !. 
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Under the above conditions, r _1 Ya=i[X' f(uJi)W(uJi)] -^K[X';H'] with prob- 
ability 1 and in mean (with respect to Q), as the number of realizations 
r — ► oo. 

The mapping theorem is an easy application of the law of large numbers 
to (A.l). 
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